Skip to main content

Reply to this post | See parent post | Go Back
View Post [edit]

Poster: brewster Date: Jan 30, 2005 10:54pm
Forum: toronto Subject: Re: Universal OCR

If there were a tool to go from

djvuxml -> distributed proofreaders -> djvuxml
and preserve as many bounding boxes as possible (some of that will be difficult or impossible, so it is not that important that it have all)

Then we have a set of images-of-words and unicode-words-- or you can think of it as a training set for OCR.

We have gotten interest from the Machine Learning folks in making a universal OCR engine out of this.

What would be particularly interesting is non-roman scripts, so we may need to construct the DJVUxml more from scratch.

If anyone is interested in this, please let us know by forum post, email, or phoning the archive.

-brewster